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In this paper, we present a nonparametric estimation procedure for the multivariate Hawkes point 
process. The timeline is cut into bins and—for each component process—the number of points in 
each bin is counted. The distribution of the resulting “bin-count sequences” can be approximated by 
an integer-valued autoregressive model known as the (multivariate) INAR(p) model. We represent 
the INAR(p) model as a standard vector-valued linear autoregressive time series with white-noise 
innovations (VAR(p)). We establish consistency and asymptotic normality for conditional least-squares 
estimation of the VAR(p), respectively, the INAR(p) model. After an appropriate scaling, these time 
series estimates yield estimates for the underlying multivariate Hawkes process as well as formulas for 
their asymptotic distribution. All results are presented in such a way that computer implementation, 
e.g., in R, is straightforward. Simulation studies confirm the effectiveness of our estimation procedure. 
Finally, we present a data example where the method is applied to bivariate event-streams in financial 
limit-order-book data. We fit a bivariate Hawkes model on the joint process of limit and market 
order arrivals. The analysis exhibits a remarkably asymmetric relation between the two component 
processes: incoming market orders excite the limit order flow heavily whereas the market order flow 
is hardly affected by incoming limit orders. 

Keywords: Hawkes process; estimation; integer-valued autoregressive time series; contagion model; 
intraday financial econometrics 

JEL Classification: C13, C14, C22, C51 


1. Introduction 

In this paper, we introduce a nonparametric estimation procedure for the multivariate Hawkes 
point process; see Definition 3.6 for the formal definition and Figure B1 for an illustrative 
summary of the main results. The Hawkes process is a model for event streams. Its alternative 
name, “selfexciting point process”, stems from the fact that any event has the potential to 
generate new events in the future. Our estimator gives substantial information on this excitement: 
nonmonotonicities or regime switches in the excitement of the fitted Hawkes model can be 
detected; the estimates may also help with the choice of parametric excitement-functions. The 
asymptotic distribution of the estimator can be derived so that confidence bounds are at hand. 
Also note that the presented estimation method is numerically less problematic than the standard 
likelihood-approach. Last but not least, the figures generated from the estimation results are a 
graphical tool for representing large univariate and multivariate event datasets in a compact 
and at the same time informative way. In particular, the estimation results can be interpreted 
as measures for interaction and stability of empirical event-streams. This will be highlighted in 
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the data example at the end of the paper where we apply the estimation procedure to the order 
arrival times in an electronic market. 

The Hawkes process was introduced in Hawkes (1971a,b) as a model for event data from 
contagious processes. Theoretical cornerstones of the model are Hawkes (1974), Bremaud and 
Massoulie (1996, 2001), Liniger (2009) and Errais et al. (2010). For a textbook reference that 
covers many aspects of the Hawkes process; see Daley and Vere-Jones (2003). The main theo¬ 
retical reference for the following presentation is our own contribution Kirchner (2015), where 
we show that Hawkes processes can be approximated by certain discrete-time models. 

By the omnipresence of “event”-type data, the Hawkes process has become a popular model 
in many different contexts such as geology, e.g., earthquake modeling in Ogata (1988), internet 
traffic, e.g., youTube clicks in Crane and Sornette (2008), biology, e.g., genome analysis in 
Reynaud-Bouret and Schbath (2010), sociology, e.g., crime data in Mohler et al. (2011), or 
medicine, e.g., virus spreading in Kim (2011). A most active area of scientific activity today is 
hnancial econometrics with applications of Hawkes processes to the modeling of credit defaults 
in Errais et al. (2010), extreme daily returns in Embrechts et al. (2011), market contagion in 
Ai’t-Sahalia et al. (2015) and numerous applications to limit-order-book modeling such as high- 
frequency price jumps in Bacry et al. (2012) and Chavez-Demoulin and McGill (2012), order 
arrivals in Bacry et al. (2011), or joint models for orders and prices on a market microstructure 
level in Muzy and Bacry (2014). Early publications applying the Hawkes model in the financial 
context are Bowsher (2002), Chavez-Demoulin et al. (2005) and McNeil et al. (2005). 

The paper is organized as follows: Section 2 explains how Hawkes processes can be approx¬ 
imated by specific integer-valued time series and how this approximation yields an estimation 
procedure. Section 3 defines the new Hawkes estimator formally and discusses its properties. 
Section 4 refines the procedure by giving methods for a reasonable choice of the estimation pa¬ 
rameters Section 5 presents the data example where the ideas of the paper are applied to the 
analysis of intraday financial data. The last section concludes with a discussion on the implica¬ 
tions of the presented results. Appendix A contains proofs and Appendix B presents illustrating 
hgures. Large parts of the paper are accompanied by examples with simulated data: in favor 
of a linear reading flow, we directly illustrate all new concepts with such examples—instead of 
devoting a separate section to simulations. 


2. Approximation of Hawkes processes 

In this section, after dehning the Hawkes process we introduce autoregressive integer-valued time 
series. We clarify how this model approximates the Hawkes model and how this approximation 
yields an estimation procedure. 


2.1 The Hawkes process 

From a geometric point of view, a Hawkes process specifies a distribution of points on one or 
more lines. Typically, the lines are interpreted as “time” and the points as “events”. Selfexciting 
point process is the common alternative name for the Hawkes process. It highlights the basic 
idea of the model: given an event, the intensity—the expected number of events in one time 
unit—shoots up (“selfexcites”) and then decays (“forgets its past gradually”). The shape of this 
decay is specified by a function, namely the excitement function. The dehnition and the proof 
of existence of a Hawkes process are subtle matters. For rigorous theoretical foundation, we 
refer to Liniger (2009), Chapter 6. We assume a basic underlying probability space (D,P,T"), 
complete and rich enough to carry all random variables involved. On this probability space, we 
define stochastic point-sets "P C M of the form V = {..., T_i, Tq, Ti,... } with < Tfc+i, /c G Z, 
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having almost surely no limit points. Furthermore, we assume that the cr-algebras 


— 


:= fj 


w G n : # (Viuj) n (a, 6]) = n : n G Nq, a < b < t 


t G 


are subsets of By setting 


Np{A) :=#{VnA), AeB{R), 

any stochastic point-set V defines a random measure Np on B(R), the Borel sets of M. At this 
point, we drop the V index; the set V is completely specified by N := Np. In this paper, we call 
a random measure N of this kind point process and we call the filtration [71^) := {'Hf) history 
of the point process. The conditional intensity of a point process N is 


E 


^N{t) ■= lim ■ 
<54.0 


Nit,t + 6)\ni 


N 


, t G 


A Hawkes process is a stationary point process N with conditional intensity 


T^Nit) = rj + / h{t — s)N (ds), tGM. 


( 1 ) 


( 2 ) 


The constant 77 > 0 is called baseline intensity, and the function h : M>o —)• M>o, measurable, is 
called excitement function. Necessary existence-conditions are discussed below. 

For d G N, a d-variate Hawkes process N is a process with d point processes on M as compo¬ 
nents, i.e., N = ..., . Each component process counts points from random point- 

sets Vi C M, ..., Pd C M. In this multivariate setup, the counting processes k = 1,... ,d, 

do not only selfexcite but in general also interact with each other (“crossexcite”). The base¬ 
line intensity ry is a d-variate vector in M>q and the excitement function is a measurable d x d 
matrix-valued function H = (^jj)i<j : IK>o The conditional intensity of a d-variate 

Hawkes process is M>Q-valued with 


E 


AN(i) := lim 
<540 


lHit,t + S)\H^, 


N 


= 77 - 1 - / H{t — s)'N (ds), t G M, 


( 3 ) 


where, for i = 1,..., d. 


a « 

H{t - s)N (ds) I := 1 XI / (ds) 


( 4 ) 


and H^ := G H ; N ((a, 6]) = n|, n G Nq, a < 6 < . In other words, the entry hij{t) 

of the matrix H (t) denotes the effect of any event G Vj in component j on the intensity of 

(i) 

component i at time TH -|- t. See Figure B2 for an example of a bivariate Hawkes process. In 
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Hawkes (1971b), we find the following sufficient condition for existence: if 


spr{K) := max ||/c| : k eigenvalue of matrix i^| 


< 1 , 


( 5 ) 


/oo \ 

where K := I f hij{t)dt ) , then a process with conditional intensity as in (3) exists. The 

^0 / l<i,j<d 

matrix K in (5) is sometimes referred to as branching matrix and the entries of K as branching 
coefficients. These terms reflect an alternative view on the process as a special cluster process 
(Hawkes 1974); 

In each of the components of a d-variate Hawkes process, we observe cluster centers that stem 
from independent homogeneous Poisson processes with rates rji,... ,T]d. These cluster centers are 
also called immigrants or exogenous events. Such an immigrant G M in component j triggers 
d inhomogeneous Poisson processes in components i = 1,... ,d with intensities hij (• — , i = 

1,... ,d. And each of these new points again produces d inhomogeneous Poisson processes in a 
similar way, so that the clusters are built up as a cascade of inhomogeneous Poisson processes. 
The non-immigrant events are called offspring or endogenous events. Disregarding the time 
component and only considering this immigrant-offspring structure, one actually has a branching 
process with immigration, where the number of direct offspring in component i from an event 
in component j is Pois^Kij^ distributed. 


2.2 Parametrization and estimation of Hawkes processes 

In most cases, the data analyst’s choice of the excitement function id of a Hawkes process is a 
somewhat arbitrary parametric function—the main decision being between exponential functions 
or power-law functions. The function parameters are then estimated via standard likelihood 
maximization. Power-law decay of the excitement functions often turns out to be more “realistic” 
in applications; exponential decay yields a likelihood that is numerically easier to handle by 
recursive representation; see Ogata (1988). In addition, exponential excitement functions are 
mathematically attractive because they yield a Markovian structure for the conditional intensity; 
see Errais et al. (2010). Even if the choice between exponential and power-law decay is handled 
carefully, these two functional families cannot catch regime switches or nonmonotonicities of 
excitement functions as in Figure B2. So it seems important to develop methods that can identify 
shapes of excitement in data with less stringent assumptions. Another motivation for our research 
on estimation of the Hawkes model stems from numerical issues—especially encountered in the 
multivariate case. A third gap that we aim to close with our paper is the derivation of the 
asymptotic distribution of the estimates. 

Note that in Bacry et al. (2012) another nonparametric method for the estimation of the 
multivariate Hawkes process is developed; it can be interpreted in our approximation framework. 
We will touch this alternative approach in Section 3.2. 


2.3 Intuition of the approximation 

The main idea is simple: given a (possibly multivariate) Hawkes process, we divide the time 
line into bins of size A > 0 and count the number of events in each bin (for each component). 
These “bin counts” form an No-valued stochastic sequence (Nq- valued in the d-variate case). 
The distribution of this sequence can be approximated by a well-known time series model. We 
present the heuristics behind the approximation in the case of a univariate Hawkes process N 
with baseline intensity ?? > 0 and excitement function h with J hdt < 1. For some A > 0, we 

define the bin-counts := A((n — 1)A, nA), n G Z. We want to argue that for small A > 0 
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and large p G N, we have that 


E 




^ ^n-2’ ■ ■ ■ 


Arj + '^Ah(Ak)xitl 


n € Z. 


k=l 


We divide the approximation above in three separate approximation-steps; 

/ \ 

/ ITT* r A 


E 




E 


A(oi^r„-i)A 


dt 


(n—1)A 


( 2 ) 


(n—1) A 


Ar/ + A J h{nA — u)N (dtt) 


(n—1)A 


Ar] + A 


h{nA — u)N (du) 


(n—p—1)A 


t' 

Ai] + J2Ah(Ak)xl%, „e 


z, 


k=l 


( 7 ) 

( 8 ) 

(9) 


The estimator we are about to present ignores the three approximations above and treats 
them as equalities. In doing so, we make a distributional error (7), a cut-off error (8) and a 
discretization error (8). There is an integer-valued time series that solves the approximative 
bin-count equation (6) to the point: the integer-valued autoregressive model of order p G N, the 
INAR(p) model. 


2.4 The INAR(p) model 

The INAR(p) process was first proposed by Li and Yuan (1991) as a time series model for count 
data. For the history and an exhaustive collection of properties of the model; see da Silva (2005). 
For a textbook reference; see Fokianos and Kedem (2012). The main idea of the construction is 
to manipulate the standard system of autoregressive difference-equations ^ akXn-k = 

En, n G Z” in such a way that its solution (Y„) is integer valued. This is achieved by giving the 
error terms a distribution supported on Nq and substituting all multiplications with independent 
thinning-operations. The following notation from Steutel and van Harn (1979) makes the analogy 
particularly obvious. 

Definition 2.1 : For an No-valued random variable Y and a constant a > 0 define the thinning 
operator o by 


Y 


aoY:= 


fc=i 


(a) 
k ’ 


where ^ 2 °^^ • • • a.re i.i.d. and independent of Y with ~ Poisson(a). We use the convention 

eL. 4”’ = 0- 

We immediately present the multivariate version of the thinning operator and the multivariate 
version of the INAR(p); 
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Definition 2.2: For adxd matrix A = G and an N'^-valued random variable 

X = (Xi, X 2 , ..., X^)'^, define the multivariate thinning operator ® by 


A®X 




\ 


yE^(^d,joXj j 


where the thinnings (ojj o •) operate independently over 1 < f, j < d. 

Definition 2.3: Let d,p G N, G k = I, ... ,p, G M>o and (en)„g 2 an i.i.d. 

sequence of vectors in Nq with mutually independent components Eq,* ~ Pois(ao,i), i = 1,... ,d. 
A d-variate INAR(p) sequence is a stationary sequence of Ng-valued random vectors; 

it is a solution to the system of stochastic difference-equations 


p 

Af^ ® X^ — /;; ® ^ G 

k=l 


where the “®” operate independently over k and n and also independently of (en)- We refer to 
ag as innovation-parameter vector and to Af^, k = 1,2,... ,p, as thinning-coefficient matrices. 

This model has first been considered in Latour (1997). In the same paper we find that if all 
zeros of 


z I—)• det 




z G C, 


( 10 ) 


lie inside the unit circle, then a multivariate INAR(p) process as in Definition 2.3 exists. 

Consider a univariate INAR(p) sequence (A„) with innovation parameter ag and thinning 
coefficients ak, k = 1,... ,p. Note that the criterion from above now simply reads Y%=i < 1- 
Under this condition, we have that X„|X„_i, X„_ 2 ,... ~ Pois (ag -|- Efc=i In partic¬ 

ular, E[X„|cj(A„_i, A„_ 2 , ...)] = ao + Efc=i OikXn-k —which is the exact version of (6). The 
INAR(p) sequence has a similar immigrant-offspring structure as the Hawkes process. In the 
time series case, the (possibly multiple) immigrants at each time step stem from i.i.d. Pois(a;g) 
variables. Each of these immigrants produces Pois(afc) new offspring events at k time steps later. 
Each of these offspring events again serves as parent event for new offspring etc. A more obvious 
choice for the distribution of the counting sequences in Definition 2.1 would be Bernoulli. Note, 
however, that for small thinning coefficients, the Poisson and the Bernoulli approaches are very 
similar. Also note that the Poisson distribution is more convenient for our purpose: we want to 
interpret the INAR(p) model as an approximation of the bin-count sequence of a Hawkes process 
and in the Hawkes model, an event can have potentially more than one direct offspring event 
in a future time-interval. In addition, in the Poisson case, we do not have to exclude thinning 
coefficients larger than one. 


2.5 Approximation of the Hawkes process by the INAR(p) model 

We examine the close relation between Hawkes point processes and INAR time series in Kirchner 
(2015). Eor a particularly obvious parallel, the reader may consider the analogy of the existence 
criteria (5) and (10). Our cited paper gives a precise convergence statement for the univariate 
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case. After establishing existence and uniqueness of the INAR(oo) process as a generalization of 
Dehnition 2.3 with d = 1 and p = oo, we prove 

Theorem 2.4 : Let N be a univariate Hawkes process with baseline intensity r/ > 0 and 
piecewise-continuous excitement function h : M>o —)• M>o such that ^ A < 1 for 

all A G (0,1). Furthermore, let be a univariate INAR(oo) sequence with innovation 

parameter := Ar/ and thinning coefficients := Ah (fcA), A: G N, and define a family of 
point processes by 

iV(^)((a,6]) := a<6, Ag(0,1). 

n: nA£{a,b] 

Then we have that, for A 0, the INAR(oo)-based family of point processes converges 

weakly to the Hawkes process N. 

Proof: This is Theorem 3 in Kirchner (2015). □ 

Note that weak convergence of point processes is equivalent to convergence of the correspond¬ 
ing hnite-dimensional distributions; see Daley and Vere-Jones (2003), Theorem 11.1.VII. The 
other result from Kirchner (2015) that is important for our estimation purpose is the fact that 
INAR(oo) processes can be approximated by INAR(p) processes, p < oo: 

Proposition 2.5: Let (V„) be an INAR(oo) sequence with innovation parameter oq > 0 
and thinning coefficients Ofe > 0, /c G N. Furthermore, let be a corresponding INAR(p) 

sequence, where the thinning coefficients are truncated after the p-th lag. That is, has 

(v) (v) 

innovation parameter Og := oq CLnd thinning coefficients := l{fc<p}afc, A: G N. Then, 
for p ^ oo, the finite-dimensional distributions of converge to the finite-dimensional 

distributions of (V„) • 

Proof: This is Proposition 3 in Kirchner (2015). □ 

We have not worked out the multivariate versions of Theorem 2.4 and Proposition 2.5 above. 
However, the simulations presented further down in the paper support the assumption that 
both results also hold in the multivariate case. Under this assumption, we have the following 
approximation: 

Basic Approximation 

Let N be a d-variate Hawkes with baseline-intensity vector rj and excitement function 
H as in (3). Let be a d-variate INAR(oo) sequence with innovation-parameter 

vector ag^^ := A rj and thinning-coefficient matrices := AH{kA), A: G N. Furthermore, 
for p G N, let be a corresponding INAR(p) sequence with baseline intensity 

p thinning-coefficient matrices := k = l,...,p. Then, for 

small A > 0 and large pA > 0, we have that 

(A(0, A), A(A, 2A),..., A((m - 1)A, mA))) 

~ men. 
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If supp(i7) C [0, s] for some finite s > 0, then the second approximation becomes an equality 
for all p > [s/A]. 


The approximation summarized in the box above is the key observation for our estimation 
procedure : 

(i) Choose a small bin-size A > 0 and calculate the bin-count sequence of the events stemming 
from the Hawkes process. 

(ii) Choose a large support s := pA and fit the approximating INAR(p) model to the bin-count 
sequence via conditional least-squares. 

(iii) Interpret the scaled innovation-parameter estimate slq^’^^/A as the natural candidate for 

an estimate of r/ and, for k G {1, 2 ,... ,p}, interpret the scaled thinning-coefficient matrix 
estimates as natural candidates for estimates of H (A:A). 

Before giving the formal definition of the estimator in the next section, we illustrate the power 
of the presented method in Figure Bl. 


3. The estimator 

In this section, we first discuss estimation of the approximating INAR(p) process. Then we define 
our Hawkes estimator formally and collect some of its properties. Furthermore, we present results 
of a multivariate simulation study that support our approach. 


3.1 Estimation of the INAR(p) model 

There are several possibilities to estimate the parameters of an INAR(p) process. As the mar¬ 
gins are conditionally Poisson distributed, in principle, maximum-likelihood estimation (MLE) 
can be applied. In our context, however, numerical optimization of the likelihood is difficult, 
as the number of model parameters will typically be very large. A method-of-moments type 
estimator would be the Yule-Walker method (YW). A third method is the conditional least- 
squares estimation (CLS). We formulate the estimation in terms of CLS rather than in terms 
of YW for three reasons: (i) For small sample-sizes, CLS is known to have a lower variance 
than YW. (ii) The CLS-estimator allows to present the estimation of excitement function and 
baseline intensity in the same formula, (iii) The asymptotic properties of the YW- and the CLS- 
estimator are the same. Their derivation, however, is typically done in the CLS-setting. In any 
case, even for medium sample-sizes, we note only very small differences between MLE-, YW- 
and CLS-estimates in simulation studies (not illustrated). Inference on CLS-estimation in the 
univariate INAR(p) context has been discussed, e.g., in Li and Yuan (1991) and Zhang et al. 
(2010). In both papers, the reasoning is performed along the lines of Klimko and Nelson (1978), 
which was originally developed for CLS-estimation of time series with the very general structure 
“E [Xn\Xn-i^ • ■ • ] = A„_ 2 , ••.)”, where ge may be nonlinear. However, as already no¬ 

ticed in Latour (1997), INAR(p) sequences can be represented as standard AR(p) models with 
white noise innovation terms. This yields ways for inference that are more direct. 

Proposition 3.1: Let (X„) he a d-dimensional INAR(p) sequence as in Definition 2.3 with 
innovation-parameter vector ao G M>o \{0rf} and thinning-coefficient matrices Aj. G k = 

1, 2,... ,p, such that (10) holds. Then 


Un 


Xn ag ^ ^ Afc X^_fc, 


k=l 
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defines a (dependent) white-noise sequenee, i.e., (u„) is stationary, Eu„ = 0^, n G Z, and 


E 


UnU^, 


diag - Y. 

Odxd, 




n = n 


n n!. 


Proof: This can be shown by straightforward (if lengthy) calculations; see Appendix A.l. □ 

As a consequence of Proposition 3.1, a d-variate INAR(p) process can be represented as a 
standard d-variate autoregressive time series with (dependent) white-noise errors: 

Corollary 3.2: Let (X„) be the multivariate INAR(p) sequence and (u„) the white-noise 
sequence from Proposition 3.1. Then (X„) solves the system of stochastic difference-equations 

p 

Xn = ao -|- Afc X„_/j -|-Un, n G Z . 

k=l 


Such vector-valued time series with linear autoregressive structure have early on been exam¬ 
ined; see, e.g., Hannan (1970). However, estimation in a multivariate context requires cumber¬ 
some notation. In order to make our results comparable, we follow one reference throughout, 
namely the monograph Liitkepohl (2005). Adapting its notation is also the reason why we work 
with wide matrices—i.e., matrices having a number of columns in the order of the sample size— 
instead of the more common long matrices. 

Definition 3.3: Let {xk)k£n be an valued sequence, where we interpret as a column 
vector. Fix p and n G N, p < n, and define the multivariate conditional least-squares estimator 
as 


oipw) . Tadxn 
^CLS ■ ^ 


bdx (dp+l) 


(Xl, 




qiPW) 

^CLS 


(xi,. 




:=YZ' ZZ 


-1 


where 


Z (xi,... ,x„) 



Xp 

Xp+1 ... 

X„-l 


Xp-1 

Xp . . . 

X„-2 


Xl 

X2 . . . 

Xn—p 

V 

1 

1 ... 

1 / 




is the design matrix and Y (xi,..., x„) := (xp+i,Xp_|_ 2 ,..., x„) G . 

Dealing with multivariate time series the following notations turn out to be useful: 

Definition 3.4: The vec{ )-operator takes a matrix as its argument and stacks its columns. The 
binary ( 8 )-operator is the Kronecker operator: for an mxn matrix A = (ajj) and apxq matrix B, 
{A0 B) is the mp x nq matrix consisting of the block-matrices aijB, i = 1,... ,m, j = 1,..., n. 

The vec-notation arises because the estimator is matrix-valued and we have no notion of the 
covariance of a random matrix. As we will see the ( 8 )-notation is strongly related to the vec- 
operator. For a large collection of properties of these operators; see Appendix A of Liitkepohl 
(2005). The following theorem collects all relevant information for CLS-estimation of multivariate 
INAR(p) sequences. Together with the approximation results from Section 2.5, this theorem is 
the theoretical basis for our Hawkes estimation procedure. 
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Theorem 3.5: Let (X„) be a d-dimensional INAR(p) sequence as in Definition 2.3 with 
innovation-parameter (column) vector ao G M>o \{0rf} and thinning-coefficient matrices G 
/c G {1, 2,... ,p}, such that spr (X]fc=i ^k) < 1- 

B := {Ai,A 2 ,...,Ap,ao) eM‘='x(^p+i) and 
;= 9^SLk(^k)k=i,...,n) G 

the CLS-estimator with respect to the sample (X.k)k=i,...,n- Then a is a weakly consistent 
estimator for B. Furthermore, let Z be the design matrix from Definition 3.3 with respect to 
{^k)k=i,...,n- Assume that the limit 

-J— Z Z^ r G M(<ip+i)x((ip+i)^ n —^ oo, (11) 

n — p 


exists and is invertible. In addition, assume that the model is irreducible in the sense that 
P[Xo,i = 0] < 1, i = 1, 2,..., d. Then, for the asymptotic distribution o/vec ^B^ G 
one has, for n —)• oo, 

-v/rP^^vec(B^”^) — vec(B)^ 

Md'^p+diOd'^p+d, (r~^ <8) Idxd) W (r“^ (g) Idxd) ) , 


where 


IT :=E 


Zo ®ldxd) Uo (( Zo (8)ldxd)uo 


g ^{d'^p+d,)x{d?p+d) 


( 12 ) 


with 


p 

uo := Xo -ao — ^ Ak X_fc 
k=l 


and Zn := ( X 




,xTj„i 


T 


Proof: In view of Corollary 3.2, it suffices to prove Theorem 3.5 for the corresponding vector¬ 
valued autoregressive time series. So the distributional properties of the CLS-estimator can be 
derived similarly as in Liitkepohl (2005), pages 70-75, where independent errors are assumed. 
We provide a highly self-contained proof for the dependent white-noise case in Appendix A.2. □ 

Note that the condition P[Xo,i = 0] < 1, i = 1, 2,..., d, in Theorem 3.5 above is purely 
technical: if we had P[Xo,i(, = 0] = 1 for some io, this would imply that in one component of our 
sample we cannot observe any events. We may exclude this case with a clear conscience. 


3.2 The Hawkes estimator 

Combining Theorem 3.5 with the basic approximation from Section 2.5 yields the following 
estimator for multivariate Hawkes processes: 

Definition 3.6: Let N = ^ N^‘^\ ..., be a d-variate Hawkes process with baseline- 

intensity vector p G M>q \{0rf} and excitement function H = (hij) : M>o — )• such that 

(5) holds. Let T > 0 and consider a sample of the process on the time interval (0,T]. For some 
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A > 0, construct the Ng-valued bin-count sequence from this sample; 

X(^) := ('iV(^)(((fc-l)A,A:A])') , = 1, 2,..., n := [T/AJ . (13) 

V ^ Vi=i,...,d 

Define the multivariate Hawkes estimator with respect to some support s, A < s < T, by 
applying the CLS-operator from Definition 3.3 with maximal lagp := [s/A] on these bin-counts: 


(( 4 "'),.,.„) ■ <“) 

We collect the main properties of the estimator in the following remark. 

Remark 1 : The following additional notation clarifies what the entries of the matrix 

actually estimate: 






From Theorem 3.5 on estimation of INAR(p) sequences together with the basic approximation 
in Section 2.5, we see that, for 0 < t < s, 

(^LRAJ ) ’ b i = • • • > respectively, ^ ’ f = 1 ,..., d, 

are weakly consistent estimates (for T —)• oo, A —)■ 0 and s = Ap —)• oo) for the excitement- 
function component value hij{t), respectively, for the baseline-intensity vector component rj^. 
Furthermore, we find from Theorem 3.5 that 


vec Afd^p+d (vec (H) , (F"! ® l^xd) W (F"^ ® , (15) 

where F and W are defined as in (11) and (12) with respect to the bin-count sequences. Substi¬ 
tuting F and W with their empirical versions yields the covariance estimate 

52 ^ ( (Z Z^) ® Idxd) (Z Z^) ® Idxd) , (16) 

' k=p+i ^ ’ 


where Z is the design matrix from Definition 3.3 with respect to the bin-count sequence and, for 
k = p^ 1,P + 2,... ,n. 


Wfc : = 




T 



AH, 


(A,s) 


x. 


k-l 


1=1 


Following formulas are useful for implementation of confidence intervals: 


Cov 




q2 

‘^{ki-l)d'^+{ji-l)d+ii, {k2-l)d‘^ + {j2-l)d+i2 ’ 


11 



September 8, 2015 


for ii,i 2 ,jij 2 G {1,... ,<i} and ki,k 2 G {1, • • • ,p}. 

Cov = Spd 2 +i,,pd^p+i,, for ii,i 2 G {1, 

Applying Remark 1 above together with Definitions 3.3 and 3.6, our Hawkes estimation proce¬ 
dure may be implemented in a straightforward manner. However, we emphasize that the resulting 
matrix in (14) does not completely specify a fitted Hawkes model; it only yields pointwise 

estimates on a grid, whereas the true excitement-parameter is a function on M>o; see Section 2.1. 
To complete the estimation, we have to apply some kind of smoothing method over the pointwise 
estimated values. We work with cubic splines, normal kernel smoothers and local polynomial re¬ 
gression (ksmoothO, smooth, spline0 and loessO in R). We find that the results do not vary 
significantly. The choice of the estimation parameters bin-size A and support s has more impact. 
Therefore, we focus on the selection of these estimation parameters; see Section 4.. The smooth¬ 
ing idea will be relevant in Section 4.2, where we discuss variance issues. In many applications, 
one can even avoid choosing and applying a smoothing method: practitioners might want to 
use our estimation procedure from Definition 3.6 for identifying or rejecting certain parametric 
models. For such purposes, the pointwise estimates suffice. The same is true if the estimation 
procedure is used as a mere tool for representing large event datasets; see Section 5.3. Finally, 
one is often only interested in the integral of the excitement; see comments after (5). In this 
case, it makes more sense to directly add up the estimates rather than to take the detour over 
some smoothing method. 

Finally, we refer to the alternative nonparametric Hawkes estimation approach from Bacry 
et al. (2012). Here, an implicit equation for the autocovariance density of a Hawkes process is 
solved for the excitement function by Fourier analysis. This approach corresponds to directly 
applying YW-estimation to the bin-count sequences—if somewhat in disguise. Our procedure 
highlights the underlying approximation principle. This explicit connection with powerful time 
series theory seems more fertile than the manipulations in Fourier space; it is more intuitive, 
simpler to implement and yields much simpler ways for inference. 


3.3 Simulation studies 

We check the distributional properties of the Hawkes estimator collected in Remark 1 in a 
simulation study. The results are summarized in Figure B3. We simulate 2 000 times from a 
bivariate Hawkes model with baseline intensity t] = {rji,r] 2 )~^ = (0.5,0.25)"'^ and excitement 
function 


rr(,. _ [hi,lit) hl,2it) \ _ ( 0 ll<i<30.25 \ . 

^ \h 2 ,i{t) h 2 , 2 it) ) \^0.5(1lt<vr0.2 sin(t) ) ’ 


(17) 


see Figure B2 for this parametrization and Figure B1 for an estimation of a single realization. In 
each simulation, about 5 000 events in each component are generated and our Hawkes estimator 
(14) is calculated. We apply a bin size A = 0.2 and a support parameter s = 6. These calculations 
yield 2 000 matrices of the form G We examine the estimations of rji = 0.5, i.e., 

the baseline-intensity for the first component, and the estimations of /i 2 ,i(l) = 0.125, i.e., the 
crossexcitement on component 2 from component 1 after one time unit. These values correspond 
to the entries and in the estimator matrices. We find that the 2 000 estimates 

are distributed symmetrically around the true values. The means of the estimates correspond 
almost completely to the true values. QQ-plots support the asymptotic normality result. For 
both estimations, we also calculate the variance estimates from (16). Comparing the empirical 
variance of the 2 000 estimates with the 2 000 estimated variances confirms the analytic result. 
Furthermore, the empirical covering rates for the 95%-confidence intervals are 94.5% for the 
baseline-intensity estimate, respectively, 94.8% for the excitement-value estimate. Note that the 
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applied estimation parameters A = 0.2 and s = 6 are considerably “wrong”: the bin-size is quite 
large and the true support of H would be oo. We may interpret the successful estimation as a 
sign for the robustness of the method with respect to the estimation parameters. 

Separately, we examine the impact of the choice of the bin-size A, the support s and the 
size of the sample window [0, T] on the variances of the estimates; see Figure B4. For various 
A, s and T, we calculate (16), the estimated covariance of the estimator matrix with respect 
to a single very large sample. We find that the excitation and baseline estimation variances 
with respect to sample windows [0, T] are proportional to T~^. Variances slightly increase if we 
increase the support parameter s. The variance of the baseline intensity estimate with respect 
to A is roughly constant in A. However, the variance of the excitement estimate with respect 
to A is proportional to A“^. Albeit this relation, we will see in Section 4.2 that the excitement 
estimates are still meaningful for very small values of A. 


4. Refinements 

Our Hawkes estimator ^ from Definition 3.6 depends on a bin size A > 0 and on a support 
s > 0. In the following section, we present procedures for sensible choices of these parameters. 
Furthermore, we discuss numerical and diagnostic issues. 


4.1 The choice of support 

Estimating the support of the excitement function of a Hawkes process corresponds to estimating 
the largest lag of a nonzero thinning-coefficient (matrix) of the approximating INAR sequence. 
In view of the VAR(p) representation of INAR(p) sequences from Corollary 3.2, we can use any 
model-selection procedure stemming from traditional time series analysis; see Chapter 4.3 of 
Liitkepohl (2005) for an overview of such procedures in the multivariate context. For comparison 
of different order-selection methods for univariate INAR(p) sequences; see da Silva (2005) . As 
a most common example, we apply Akaike’s information criterion (AIC); see Akaike (1973). 

We work in the setup of Definition 3.2. The starting point is a sample of a d-variate Hawkes 
process on [0, T] together with a preliminary bin-size Aq > 0. In our experience, a preliminary 
bin-size such that there is about one event in average per bin and component is a good choice. 
With respect to this Aq, we calculate the bin-count sequence(s) as in (13). Now let no := [T/AoJ 
and So > 0 be some very large support, e.g., so = T/10. Then, for p € {1,2,, [so/Aq]}, we 
calculate Akaike’s information criterion 


AIC(p) := log (deti](p)) -F , (18) 

V / no — p 

where S(p) := l/(no - p) YT=p+i • Here, with the notation from Remark 1, 

p 

4'’^ := Xfc -Ao77^^“’*) - AoH4“’"^ k = p+l,p + 2,...,no, 

1=1 

denote the estimated prediction-error vectors with estimated coefficient-matrices from the fit of 
the approximating INAR(p)-model with respect to p lags; see Liitkepohl (2005) for the mul¬ 
tivariate AlC-formula (18). Finally, we choose := argminp <|-^^/^^1 AIC(p) as estimated 

maximal lag for the approximating INAR model, respectively, we choose := as 

support parameter in the calculation of the Hawkes estimator (14). 

In parametric Hawkes setups, the support of the Hawkes excitement function is typically chosen 
infinite. Our estimation procedure, however, assumes finite excitement. This is not a big issue: 
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from (5), we get that the excitement-function components vanish for large times. In other words, 
the influence of the tail of the excitement on the model is negligible; see (8) and Proposition 2.5. 
The only question remaining is how we can choose a support p G N, respectively, s > 0, large 
enough so that the truncated model with the truncated excitement is a good approximation for 
the true model. For this choice also, the AlC-approach from above can help. Applying AIC- 
selection naively on data generated from a Hawkes model with infinite support yields support 
values so that the truncated part of the true excitement functions vanishes; see the simulation 
study below. As a side remark note that also the standard parametric approach suffers from 
cut-off errors as we only observe data in finite time-windows. 

We perform a simulation study for the support-choice methods described above. These two 
univariate cases are summarized in Figure B6. We consider different Hawkes models: univariate 
and bivariate processes, both with finite as well as with infinite excitement-function support. 
For the univariate case, we first simulate from a Hawkes model with excitement function h{t) := 
exp(—t)li< 3 . and calculate the AlC-minimizing support with respect to different preliminary 
Aq. All results catch the true support very well. Next, we consider the case of infinite support 
t I—)■ exp(—at) with respect to three different decay parameters a G {1.1,1.5, 2}. Again, we 
simulate large samples for each of the three a-values and then calculate the three corresponding 
AlC-minimizing support estimates. The smaller a, the larger the tail of the true excitement 
function becomes and—as desired—the larger the support estimates get. For all choices 

of a, the ignored excitement weights, /j(aic)(q,) exp(—at)dt, are very small (less than 10“^). 

Secondly, we consider a bivariate Hawkes model with the excitement function H from (17). We 
realize a single large sample from this model. Then we simulate another sample from a truncated 
version of the model, with 


^b2(t)\ ^ ( 0 li<t<30.25 \ 

\h^2Ait)h2,2{t)J + lt<n0.2sm{t) J ■ 

The AlC-minimizing support estimate is 9.5 for the original model and 4.2 for the truncated 
model. So the AlC-approach is able to discriminate between these cases. 


4.2 Choice of bin size 

In the following, we discuss the choice of the bin size A > 0 for the Hawkes estimator ^ from 
Definition 3.6. We suppose a reasonable support s > 0 has already been chosen by a procedure as 
described in Section 4.1. One can interpret the choice of the bin size A as a bias/variance trade¬ 
off: the smaller A, the smaller the potential bias stemming from the model approximation, i.e., 
the smaller the errors (7) and (9). At the same time, due to the 1/A factor in the calculation of 
the estimator matrix from (14), its (componentwise) variance increases when A decreases. 

In a simulation study, we simulate 100 times from a Hawkes model with excitement function 
h{t) = exp(—l.lt). For each sample, we calculate the Hawkes estimator with respect to three 
different bin sizes A G {0.1,0.5,1}. Figure B5 collects the estimation results in boxplots. The 
bias/variance trade-off is obvious. Note, however, that we had to choose the bin-size quite large to 
make the bias visible at all. Concerning the large variance, we should keep in mind that the final 
goal may be an estimation for the excitement-function components hij —and not only a finite 
number of their values hij{kA), k = 1, 2,... ,p. When we apply some smoothing method on these 
values, a smaller A typically leads to an “averaging” over more point estimates. This balances 
the increase in pointwise variance. So if the goal of the estimation procedure is a completely 
specified Hawkes model, then the smallest A that is numerically convenient may be chosen; see 
the discussion after Remark 1. The following toy-example clarifies things: 

We consider a smoothing method for which we can approximately calculate the variance. 
Namely, we define a box moving-average with window size r > 0. For some bin size A > 0, 
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we consider a univariate Hawkes selfexcitement estimate ^ = {h^^\ ..., h^\ As a 

smoothed function-estimate, we set 


: t 


1 -^ 


#[k : kA G [t ± t/2]} 


E 


h 


(A) 


k: fcAG[tdir/2] 


t > 0. 


Then, using that Var ss c/A, see Figure B4, Cov{h\^\h\^^) 0, k ^ I, and that 

#{/c : kA G [t ± t/2]} PS t/A, we obtain 


Var 


1 

VJW 


Var 

k: feAs[t±'r/2] 



C 

r 


In other words, the variance of the smoothed estimate is approximately constant in A. The same 
effect can be observed empirically with more sophisticated smoothing methods: for a single large 
simulated sample from a Hawkes process, we calculate the pointwise Hawkes estimator from 
Definition 3.6 with respect to three different bin-sizes A. Applying a cubic smoothing-spline 
procedure on the three results one findes that the smoothed functions are hardly affected by the 
choice of A. 

If we choose a very small bin-size A, computation time becomes an issue. The calculations 
in (14) require the construction of the design matrix Z from Definition 3.3 with about T/A 
rows and about d ■ s/A columns. Here, T is the size of the time window, d is the dimension 
of the process and s is the support parameter of the estimation. Then the matrix Z Z^ has to 
be inverted. This square matrix is approximately of size \d ■ s/A] x \d ■ s/A], In short, the 
smaller A, the larger the matrices involved. Note, however, that, for a very small bin-size A, 
the corresponding design-matrix is very sparse. Specialized software makes construction and 
manipulation of sparse matrices numerically efficient; see Bates and Maechler (2015). 

We now understand that the trade-off related to the bin-size choice is not so much a 
bias/variance trade-off but more a bias/numerical-issues trade-off! To check, if we have cho¬ 
sen A small enough, we propose to calculate the (biased) estimate of the baseline intensity 
vector 7] := (?7j)i<j<rf for a decreasing sequence of bin sizes Aq > Ai > A 2 > ... The vari¬ 
ance of the estimates n = 0,1,... is approximately constant over the different bin-sizes; 

see Figure B4. This makes the estimates comparable. For i = 1,2,... ,d, we plot the values 
/^(sAn)\ against (Aji)^^g ^ . Typically, one observes a monotone convergence in n to 

some constant (or d constants for d > 1). Plotting confidence intervals around the point esti¬ 
mates indicates when the bias is negligible in comparison to the random noise of the estimate. 
We will apply this method in the concluding data-example. 


4.3 Diagnostics 

We see a certain danger in the application of our nonparametric Hawkes estimator from Def¬ 
inition 3.6. Reasonable graphical results as in Figure B1 might be used as an argument in 
favor of the Hawkes process as the true model. But this conclusion would be a misuse of the 
method. In fact, the proposed estimator depends only on second-order properties of the data. 
So, we have to expect that there is a whole family of point processes that generate the same 
excitement estimates, although only one of these processes is a genuine Hawkes process. As an 
example, consider a continuous-time, nonnegative, stationary Markov chain that has the same 
second-order properties as some given Hawkes process. We use this Markov chain as a stochastic 
intensity for another point process; see Daley and Vere-Jones (2003), Example 10.3(e). The re¬ 
sulting doubly-stochastic point process is a point process with different distributional properties 
than the corresponding Hawkes process. But our estimator will still yield the same results in 
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both cases. As another example, consider a time-reversed Hawkes process. Clearly, this is not 
a Hawkes process anymore. However, the time-reversed version has the same autocovariance 
density as the original process and therefore our estimator will again yield the same result. 

This means, the application of our estimation approach always ought to be followed by a 
model test. A most common basis for such a test in our context is a multivariate version 
of the random time-change theorem for point processes; see Meyer (1971), Brown and Nair 
(1988): for points i = ^7 ■ ■ ■ jd, from a d-variate point process with conditional inten- 

sity A = one has that AW (t) dt Exp(l) independently over i = 1,... ,d and 

k G Z . So, after having fit the Hawkes process to point process data, we calculate the correspond¬ 
ing conditional-intensity estimate and time-transform the interarrival times. These transformed 
interarrival times ought to be compared with theoretical Exp(l)-quantiles in a QQ-plot. Next 
to this graphical method one ought to apply a Kolmogorov-Smirnov test and an independence 
test to the transformed interarrival times. 


5. Data application 

There are two contexts of growing importance where large event-datasets are not the excep¬ 
tion but the rule: internet traffic and high-frequency data in financial econometrics. The paper 
concludes with an exemplary application of the estimation procedure to the latter. 


5.1 The data 

The data we use stem from the limit order book (LOB) of an electronic market. LOBs match 
buyers and sellers of a specific asset. We will consider a certain future contract. Whoever wants 
to buy or sell one or several of these contracts has to send his or her orders to the LOB. An 
order basically consists of two pieces of information: it names (a) the maximal (respectively, 
minimal) price at which the sender is willing to buy (respectively, to sell) and (b) the desired 
quantity in terms of numbers of contracts. If the order is matched to another order, the trade 
is executed. Such orders that immediately find counterparts are called market orders. All other 
incoming orders are stacked in the LOB; these are called limit orders. Limit orders either wait 
for getting executed by a new incoming matching (market) order or—and this happens relatively 
often—they are withdrawn after some time. The empirical process of time points when orders 
arrive we call order flow. Such an order flow can be modeled by a point process. In particular, our 
estimation method from Definition 3.6 allows to analyze the order flow in a Hawkes setup. Eor a 
detailed survey of order-book quantitative analysis; see Gould et al. (2013). Financial intraday 
histories are attractive for econometric research as there is so much data available. However, by 
the very differing data qualities, results are sometimes hard to compare. To clarify our starting 
point, we explain the context and the preparation of the data quite detailed. 

We consider a sample of the LOB of E-mini S&P 500 futures with most current maturity. 

The enormous liquidity makes the data attractive for quantitative analysis. Samples of these 
particular data have also been analyzed in the Hawkes setting, e.g., by Filiminov and Sornette 
(2012) and Hardiman et al. (2013). Our particular data sample was provided by TickData inc. 
It stems from September 2013. We have a separate dataset for quotes and for trades. A new 
entry in the quotes data corresponds to one of the following three events: 

(i) Arrival of some (not marketable) limit order 

(ii) Arrival of some market order, i.e., a trade takes place 
(hi) Cancelation of some limit orders 

In the trade data set, we see the traded price and the number of contracts traded. 

In both datasets, we observe ties, i.e., multiple events with identical millisecond time-stamps. 


16 



September 8, 2015 


These ties require special consideration as our model, the Hawkes model, does not allow for 
simultaneous jumps. As data is so relatively sparse, the multiple events cannot be accidental. 
This leaves two possibilities: either the multiples stem from a single order that has been split (for 
some technical reason) or the multiples are almost instantaneous responses to each other that are 
reported at the same millisecond due to rounding. We may safely rule out this second possibility: 
as yet, it is impossible to “react” (i.e.: observe an order, send an order, let the electronic book 
record or match the new order) in less than a millisecond. In addition, we had the opportunity to 
compare our data with a snapshot of the fully reconstructed LOB. This complete data provide 
“match tags” for each order. This additional information shows that nearly all multiple events are 
in fact orders from one single market-participant. This confirms our point of view. We therefore 
consider each time stamp in the datasets only once. After the thinning procedure, we derive two 
one-dimensional event datasets from our data: 

• the trade data T and 

• the (pure) limit-order data C that collects all the times when a new non-marketable limit 
order has arrived or a limit order has been canceled. 

In busy trading hours, i.e., between 8:30am and 3:15pm (Chicago time), we observe about 5 
events per second in the trade data T, and about 12 events per second in the limit-order data 
C. At Chicago night time, all of these average intensities are up to twenty times smaller. All 
interarrival-times processes exhibit significant autocorrelation at large lags. This rules out simple 
standard homogenous Poisson point processes as models as well as other renewal processes. On 
the other hand, the autocorrelation may also stem from nonstationarities of the underlying true 
model; see Mikosch and Starica (2000). 


5.2 Bivariate estimation of the market/limit order process 

With our nonparametric method from Dehnition 3.6, we ht a bivariate Hawkes process 

(Ar(r)^jV(/:)) 

on a single 30min-sample of the data {T,C), namely on data from Friday, 
2013/09/06, 10:00am-10:30am (Chicago time). In this specific sample, we observe about 20 000 
trades and 40 000 limit orders. Our estimation procedure from Section 3.2 depends on a choice 
of support and on a choice of bin size. For a sensible choice of these parameters, we apply the 
methods from Sections 4.1 and 4.2: 

As a first step, we calculate the Hawkes estimator with respect to a relatively large preliminary 
bin-size of Aq = 0.5 sec and for various support candidates between 1 and 300 seconds. As 
proposed in Section 4.1, we compare the corresponding AIC-values. This coarse analysis shows 
that the AlC-optimal support is surely less than 20 seconds. Repeating the analysis with respect 
to a much hner bin-size Aq = 0.01 sec on the interval (0 sec, 20 sec), we find an AlC-minimizing 
support of about 2.8 sec. Let us note that the obtained minimum is much more clear-cut than 
in the controlled simulation study from Section 4.1 illustrated in Figure B6. We set s = 3 sec. 

In other words, our support analysis indicates that the process forgets its past after three 
seconds. This preliminary result is already interesting: it can be interpreted such that—in this 
sample—the algorithms that drive the market only take not more than the last three seconds of 
the LOB-history into account. 

For a reasonable choice of the bin-size parameter A, we apply the method from Section 4.2. 
That is, we examine the impact of the bin-size choice on the estimation. We leave the support 
s = 3 sec fixed and, for different bin-size candidates A, calculate the baseline-intensity estimate 
i = 1,2, together with the corresponding confidence intervals; see (14) and (15) for the 
necessary calculations. We observe a monotone relation between the bin-size candidates and the 
corresponding baseline-estimates. However, for A < 0.01 sec, the differences of the estimates are 
of a lower order than their (estimated) confidence intervals. So it is sensible to assume that, for 
this particular sample, the bias of our estimation method becomes negligible for bin-size choices 
of A < 0.01 sec. From the bivariate event dataset, we finally calculate the Hawkes estimator from 
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Definition 3.6 with respect to support s = 3 sec and bin size A = 0.01 sec. Figure B8 summarizes 
the estimation results for this specific time thirty minute window. The baseline intensity of the 
limit-order process C is about four times larger than the baseline intensity of trades process T. 
In both processes, we observe a strong and quite similar selfexcitement. The crossexcitement, 
however, is obviously directed: we observe a very strong crossexcitement from T on £, but hardly 
an effect from C on T. The estimated interactions can be summarized in the branching-matrix 
estimate 


/0.62(±0.04) 0.03(±0.01)\ . 

\^0.55(±0.06) 0.54(±0.03 )) 

See Remark 1 for the calculation of the point estimates as well as the 95%-confidence bounds of 
the branching-matrix components. Also see the explanations after (5) for the interpretation of the 
branching-matrix that is indicated in the right matrix. The largest eigenvalue of matrix (19), i.e., 
the stability-criterion estimate, is 0.72. The strong asymmetry in (19) may be interpreted such 
that the trades cause the limit orders (and cancelations) and not vice versa. In further analysis, 
we found that the estimated branching-matrix, and in particular the asymmetric crossexcitement, 
is quite stable over all thirty minute windows of the busy trading hours (not illustrated). In the 
crossexcitement from T on T, we observe local maxima at half and whole seconds. This effect 
may have two causes: it reflects a preference either for absolute or for absolute round times. To 
put it differently: some of the order-sending algorithms that indeed react on trade events may 
have an implemented lag of half or full seconds. 

In a second approach, we fit the Hawkes model to the same sample as above. This time 
however, we ignore the best support choice and set it naively to s = 0.1 sec only. In addition, we 
apply an extremely small bin size of A = 0.002 sec. In the first milliseconds after each event, the 
results indicate an inhibitory effect; the Hawkes model does not allow for negative excitement. 
Still, this result is not surprising. It reflects the fact that it takes at least 3 or 4 milliseconds 
for any market participant to observe and react to a change in the LOB. In the smoothed 
function-estimate of the selfexcitement of the first component (the trades process), we detect 
local maxima at 0.1 sec multiples. As above, this may be is a sign, that 0.1 sec is the “resolution” 
of some of the algorithms that drive the market. Also note that in this naive fit, the baseline- 
intensity estimates are much larger than in the first fit: these large values are a compensation 
for the too small support choice. 

Naturally, the fitted Hawkes model is only completely specified when we smooth the results 
from the estimation method on the grid by some kind of smoothing mechanism that yields a func¬ 
tion H : M>o —)• We do this with a cubic smoothing spline method. Having thus completely 

specified the model, we apply a Kolmogoroff-Smirnov test on the transformed interarrival-times; 
see Section 4.3. The test rejects the fitted model for the 30 min-window. This is not surprising: 
given the very large sample-size, we are very likely to include “abnormal” interarrival times that 
our model cannot catch; the Kolmogoroff-Smirnov test is particularly sensitive to such outliers. 
Dividing the 30 min-windows into smaller samples of 100 events yields plausible p-values (not 
illustrated). For further interpretation of the diagnostics; see the discussion in Section 5.3 below. 


V0.62 


T C 


0.03 


^ 0.55 ^ ^ 0.54 . 
/ L L L 


(19) 


5.3 Interpretation of the estimation results 

The interpretation of the estimation results from Section 5.2 is not straightforward: observing the 
income of an order makes people (respectively, algorithms) send other orders. In this sense, we 
may expect some quite direct true excitement in LOB data. In our modeling approach however, 
any fluctuation of exogenous processes that influence the observed event-process will also be 
detected as selfexcitement. Candidates for such covariate processes in our context are volume, 
arrival of orders away from the best bid or best ask price, spread, or even data from other assets 
such as options on S&P 500 E-mini futures. A most natural way to model this situation would be 
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a joint multivariate Hawkes model. However, doing statistics with so little knowledge about the 
state (or even the dimensionality) of the process will yield new problems. So the best way to get 
rid of artificial selfexcitement in the Hawkes model is presumably to make the baseline intensity 
more flexible. For an example of such a Hawkes model with stochastic baseline-intensity; see 
Zhao (2012). To summarize: our estimation method can indeed detect self- or crossexcitement 
in data. However, we ought to be careful with interpretation of these terms. 

The Hawkes fit is meaningful and fertile despite of the criticism above and despite of the 
vanishing p-values in our application: plots of excitement estimates as in Figure B8 are visu¬ 
alizations of huge event-datasets in a compact and at the same time informative way. In that 
sense, any Hawkes fit—and our estimation method in particular—can be used as a graphical tool 
for exploratory event-stream analysis. Furthermore, even if the Hawkes model assumption may 
be completely wrong, an excitement-function estimate hij{-) is also theoretically meaningful. It 
is an estimate for the best linear filter of E [iVj({t})/dt|(T (A(j({s}), s < t)] which is a relevant 
quantity in all stationary models. 


6. Conclusion 

This paper demonstrates that applying methods from time series theory to the bin-count se¬ 
quences of point process data yields a useful and intuitive nonparametric estimation method for 
the multivariate Hawkes process. The price for the fertile simplicity of the method is a bias due 
to the discretization involved. Simulation studies support that this bias can be controlled and 
that it is negligible for most practical means. The technique presented depends on the choice 
of the bin size and the assumed support of the excitement function(s). Methods for a sensible 
choice of these parameters are given. In any application, the robustness with respect to these 
choices ought to be studied. 

Due to space constraints, the presentation leaves out obvious subsequent topics. Confidence 
hounds for the rate of endogeneity (the branching parameter of a univariate Hawkes process), 
estimation of the excitement function on a nonequidistant estimation-grid, derivation of power- 
law deeay-parameter estimates in the parametric case via a linear regression on the log/log values 
of the pointwise estimates, and estimation of marked Hawkes processes: using the concept of our 
paper as a starting point, all these aims can be achieved in a straightforward manner. 

Finally, note that in view of the analogy between discrete-time INAR(p) sequences and 
continuous-time Hawkes processes, analysts using the Hawkes model may consider to directly 
apply the INAR(p) model in the first place—as most event data live on relatively discrete time 
grids. 
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Appendix A: Proofs 

A.l Proof of Proposition 3.1 

First, we establish that 


p 

Un .— Xn, 3-0 ^ ^ '^n—k: kl G Z, 

k=l 


dehnes a white noise sequence. Stationarity of (u„) follows from the stationarity of (Xjj). For 
the sequel of the proof, fix any n G Z. From the property E [A ® X„] = AEX„, A G of 

the thinning operation from Definition 2.2 we get 


E u„ = E 


= E 


^n—k 


Xn 30 ^ ^ X. 

k=l 
P 

Ak ® ^n-k + £n ~ 3o — ^ ^ Afc X. 


n—k 


Lfc=l 


k=l 


P P 

= 'y ^ Ak E [Xjj-fc] -|- 3o — 3o — y ^ Ay E [X„_fc] 
k=l k=l 


= 0d. 
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For the autocovariances of the sequence (u„), first note that the error u„ is uncorrelated with 
any previous value X.n',n' < n, of the original sequence. Indeed: 


E 

Un Xjj/ 

= E 

- 1 

1 

O 

1 

3 H 

_1 




L \ k=i / J 


E 

(X„-E[X„ 

iu(x„_i,x„_2,...)])x;[, 

E 

x„xT 

-E 

E [X„ X^, O' (X„_i, X„_2, 


= Orfxd- (Al) 


So that we get, for n' < n (and then, by symmetry, for n' / n). 





/ P \ 

E 

UnU^' 

= E 

l^n I ^^n' ^0 / ^ -A/c ^^n'—k 1 




\ k=l / 


= E 


u„ XJ, 


-E 


Un ag 


p 

AfcE 

k=l 


U-n X 


T 

n' — k 


Tn, r 1 T 

= -E[u„Jao 

= Orfxd- 


We have established that (u„) is a white noise sequence. In a second step, we derive its marginal 
covariance matrix. Since E u„ = 0^ and E = O^xd; /c = 1, ... ,p, we obtain 


E 


= E 




U-n X^ ag ^ ^ A]^ X. 


n—k 


k=l 




= E 


= E 




T' 


^n—k 




^0 ^ ^ -^k^n—k | f Y^Ak®X. 
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+ En ( ^ Afc @ X„_fc I +'^Ak® X„_fc + '^Ak® X„_fc ( ^^Ak®X. 
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SnS 
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^0 £n ~ ^0 E Ak ® X. 


fc=i 
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\k=l 
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\k=l 


n—k I ^ , ^k^n—k^n 
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Using independency of from the past of the process and plugging in E = uq yields 


E 


u„u 


T 

n 



p 

/ p 

P 

/ p \T 

Cov(e„) + E 

'y ^ Ak ® X„_^ 1 

^ Afc ® Xn-k 

^ ^ -^k ^^n—k 

^ Afc ® Xn-k 


fc=i 

\k=l J 

k=l 

\k=l J 


(A2) 


As the components of £n are Poisson distributed and mutually independent, we have Cov {eu) = 
diag(ao). For the second term in (A2), we condition the difference in the expectation on the 
past of the process. Then the thinnings become the only source of randomness. Furthermore, 
the counting series of the thinnings are independent of a {Xn-i, Xn- 2 -, ■ ■ ■)■ So: 


E 


^ Afc ® X„_fc ( ^ Afc ® X„_fc - ^ Afc X„_fc ( ® X, 
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k=l 


\k=l 
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\k=l 
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J x„_fc=X„_fc,/c=l,...,p 
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x„_fc=X„_fc,fc=l,...,p 


For fixed x G Nq and {atj) := A e M>q'^ we have 


A ® X = 


( E ai j o 

i=i 


\ 


y E o xj j 


The components E^=i o ^j, i = l,...,(i, of this vector are Poisson distributed with pa¬ 
rameters Definition 2.2. As the thinnings involved are all independent by 

definition, the components are uncorrelated. Therefore, the covariance matrix of the vector is 
Cov (A ® x) = diag (A x) and 


Cov 
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E 

fc=i 


diag (Afc Xn-k) = diag 



k ^n—k 


23 



September 8, 2015 


Plugging in the random variables X„_fc and taking the expectation, we continue from (A2) and 
find 


E 




^n—k 


diag (ao) + Ediag 

^ k=l 

diag I ao + t AfcEX„ 

V k=l 

diag(ao + '^Ak(ldxd-'^Aj\ ao 


^n—k 


k=l 


P X -1 


i=i 


P X -1 


diag j I Idxd — "^0 I • 

i=i 


□ 


A.2 Proof of Theorem 3.5 

The first part of the proof largely depends on matrix manipulations. So it is important to remind 
the reader that all vectors are understood as column vectors. We rewrite the INAR(p) sequence 
(Xfc ) C Nq as a standard multivariate linear autoregressive time series with white-noise error 
sequence (ufc)fcgz := (X^ -ao -h Yfi=i ^k-i)k£i. 

p 

Xfc = ao -|- ^ ^ Ai -|-Ufc, A; G Z; 

i=i 


see Corollary 3.2. Then the distributional properties of the CLS-estimator are derived similarly 
as in Liitkepohl (2005), pages 70-75, where independent errors are assumed. In the following, let 
Z G be the design matrix from the CLS Definition 3.3 with respect to the sample 

(Xi, X 2 ,..., Xji). Furthermore, let U := (up+i, Up+ 2 ,..., u„) G . Note that Z as well 

as U depend on n. We work under the assumption that 


1 


2 ^ > : r g '^{<ip+i)x{dp+i) 


n —)• 00 , 


n — p 

exists and is invertible. In addition, we use that, for n —)■ 00 , 

1 


(A3) 


\/n-p 


vec UZ 


'Afd^p-\-d 


Zo (8)ldxd ) Uo ( ( Zo (8)lrfxd)uo 


(A4) 


where Zq := (Xlj^, XI 2 , • • •, XT^, l)"*" G has the same distribution as any of the 

columns of the design matrix Z. We postpone the reasoning for (A4) to the end of the proof. As 

a first step, weak consistency of ^ G is proven. To that aim, we will use that 


Y := (Xp+i, Xp+i,..., X„) = B Z +U (g ; (A5) 
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see Definition 3.3. 


-B = YZ^ (ZZ^ 


(A5) 
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= ' (BZ+U)Z' (ZZ 

= uz' [zz'^' ^ 

uz^ fzz 


-1 


B 


n — p \n — p 


By (A3), the second factor converges in probability to the constant matrix P. By (A4), the 
first factor has the same asymptotic distribution as W/— p where fP is a matrix consisting 
of jointly normally distributed entries not depending on n. So W/^yn — p 0(ix(dp+i) 

therefore B*' ^ — B 0£;x(dp+i)- For establishing the asymptotic distribution, we treat the 
difference of the estimated and true vectorized parameter-matrix in a similar way: 


vec ( — vec (B) = vec — B 


= vec ( U Z^ f Z Z^ 


-1 


= I (ZZ 
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® Idxd'j vec (UZ^^ 
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— p \ \n — p 
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Uxd vec 
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\/n-p 


(A6) 


In the third step of the calculation above we use that 


vec {AB) = (^B~^ (g) vec (A), (A7) 

for matrices A, B and identity matrix I such that the calculations are consistent dimen- 
sionwise; see A.12 in Liitkepohl (2005). It follows from (A6) together with (A3) that 

^/n — p ("vec ("b*'’^^') — vec (B)) has the same asymptotic distribution as 



(AS) 


With (A4), we then find that the asymptotic distribution of (A8)-and therefore of 
■y/rU^ ("vec ("b*'’^^') — vec (B)) -is centered normal with covariance matrix 



n—>oo vec 



(A4) 
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We still have to establish (A4). To that aim, we rewrite the left-hand side of (A4) as 


1 if '^~P 

—^===vec f f ^ U.j, • • •, ^ 

^ n-p 

((Zi,i U. ..., Zdp+ij U.j)) 
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^ n-p 

_ p ((Uij, • • •, Urfj) (Zij, Z 2 J,..., Zrfp+ij)^ 

1 " 
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Vn — p V / 

V ^ k=p+l 

where Z^ ;= X^_ 2 ,..., Xj_p, l)"*" G Note that, for /c G {p -|- 1,..., n}, Z^ is 

the {k — p)-th column of the design matrix Z. Now, let := vec (u^ • Zj) G /c G Z. We 

show that for the sequence (w^) C a central limit theorem for vector-valued martingale 

differences can be applied. Proposition 7.9 from Hamilton (1994) states that if (w^) C is 
such that 

(a) it defines a vector-valued martingale difference sequence, i.e., there is a filtration 
{'k(-k)k=p+i,p+ 2 ,...,n such that Wfc is ^fc-measurable and E [w^ \7ik-i] = Oj, /c G Z, 

(b) E [wfcwj] =: 5 G is a positive definite matrix independent of k, 

(c) for all ki, k 2 , ks, A :4 G Z and for all ii,..., i 4 G {1, 2,..., d}, 

Z [Wkuh Wfc2,i2 Wfc3,i3 < OO, 

where j denotes the f-th component of w^, and 

(d) E 

fc=p+i 

then, for n —)■ oo. 
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Proof of (a) Define the filtration (T-lk) by setting 

TP-k ■— u ((uj, Xj_4, Xj_2,..., ^^i—p ) • i k^ 1 k Wj . 

Then one can easily check that Wfc = vec (u^ • Zj) is PLk- measurable. It suffices to prove the 
martingale-difference property for the sequence (u^) since X^/ for k' < k and therefore Z^ are 
"T^fc-i-uieasurable. But then, because 
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we obtain the martingale difference property: 

p 

IE [ufc |^fc_i] = E Xfc — ao — ^ Ajn ^k-m T-Lk-i 

m=l 

P 

= E [Xfc 3-0 “ ^ ^ ^k—m 

m=l 

= 0d. 

Proof of (b) Independency of k follows from stationarity of (wfc). Choose k = 0. We need to 
show that, for b G 

b~^ E Wfc wj 6 = E 6"'' Wo Wq b~^ = Var ^6"'' wq^ > 0. (A9) 

With (A7), we find 

wo = vec (^uo • Zq ) = (Zo ^Idxd) vec (uq) = (Zo 'S'ldxd) uq (AlO) 

and therefore 

E wqWq =E ( Zo(8)ldxd)ufc((Zo(8)ldxd)uo)"^ 

= E (Zo (8)l(ixd) uoUg (Zo (8)ldxd) • 

To establish (A9), we define the ci-algebra 

T" := (T (X_i,..., X_p, Ai @ X_i,..., Ap ® X_p). 

Note that Zo is T'-measurable and eo is independent of T". Using these facts when considering 
the expectation of the conditional variance of b~^ wo, we obtain 

Var wo^ = E Var ^6"'' wq | + Var ^E wo | T ^ 

> E Var ^6"'' wq | 

^ E [Var (6^ (Zo ®lrfxd) uo| -U) 

= E 6"^ (Zo(8)ldxd)Cov(uo|T') (^6^ (Zo(8)ldxd)) • (All) 

Since 

p p p 

uo = Xo — ag — Ai X_j = eg + Ai ® X_j — ag — Ai X_j, (A12) 

i=l i=l i=l 

the summand eg is the only term that contributes to the conditional covariance matrix in (All)— 
the other summands in (A12) are constant with respect to F and eg is independent of F. So we 
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have Cov (uo| F) = Cov (eq | F) = Cov (eq) = diag(ao) and continuing with (All) we find 

Var (^b'^ wo^ > E b'^ ®ldx<i) diag (ao) (^b~^ <^ldxd) ) 

2 

= E ^ao,i (^6"^ (^Z([(8)lrfxd) . 

- 1=1 ’ _ 

>ao,*„E (z;[®lrfxd))". >0, (A13) 

where io G {1, 2,..., d} in (A13) is chosen in such a way that ao,jo > 0. (Remember that ao / 0^, 
by assumption.) The strict inequality in (A13) follows because, for jo G {1, 2,..., np + d} such 
that bdo / 0, we have that 

IP (Zg (8)lrfxd) . /O =P b~^ ■ (^Zq (glldxd^ , /O 

— IP[^Jo -^fcolo ^ *-*] 

= IP [I^fcolo 7^ 0] > 0) 

for some ko € E and some Zq £ {1)2,... ,d} dependent on jg. Note that denotes the Z-th 
component of X^. By stationarity, fco G E is irrelevant. And the case that Xg = 0 a.s. for some 
Zo G {1, 2,..., cZ} we have excluded, so the strict inequality follows. 

Proof of (c) Note that claim (c) follows if E • • • X^g^ig] < oo for /ci,..., Zcg G 

Z, ii,..., Zg G {1, 2,..., cZ}. The boundedness of these expectations is established for the uni¬ 
variate case in Corollary 1 of Kirchner (2015). For the multivariate case, one can argue similarly 
via the existence of the moment generating function in a neighborhood of zero. 

Proof of (d) We show that (w^w^) is ergodic. Then the claim of (cZ) follows with the Birkhoff- 
Khinchin Ergodic Theorem. The sequence (X^) can be represented as margin of a pd-dimensional 
INAR{1) sequence (X^); see Latour (1997). It is easily checked that the latter is an irreducible, 
aperiodic Markov chain on Ng*^. So (X^) is ergodic; see Durrett (1995), page 338. As margins 
of ergodic processes are ergodic, (X^) also is ergodic. As can be written as a measurable 
function of the past of (X^), (w^) is also ergodic. Finally, (w^wj) is ergodic because it is a 
measurable transformation of the ergodic sequence (w^). 

□ 
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Appendix B: Figures 
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Figure Bl. Summary of the main result of the paper. From the bivariate Hawkes model presented in Figure B2, 
around 100 000 events in each component are simulated. From this single large sample, we calculate the estimator 
from Definition 3.6. The black circles refer to estimated values of the excitement functions. The horizontal black 
dotted lines in the diagonal panels refer to the corresponding estimated baseline-intensity components. The vertical 
grey lines as well as the dotted horizontal grey lines refer to marginal 95%-conhdence intervals; see Remark 1 . 
All solid and dashed lightish-grey lines refer to the true underlying parameters; compare with Figure B2. Eyeball 
examination shows that the estimation method approximates the form of the true excitement functions well. Also 
the non-monotonicities and the jumps are reproduced. The coverage rates of the confidence intervals seem just 
about right. There is no obvious bias. For a more quantitative analysis of the estimation method; see Section 3.3 
and Figure B3. 
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(a) Model parameters of a bivariate Hawkes process. The solid lines refer to the excitement function H = (hij). 
It consists of the two selfexcitement functions /ri,i(f) = 0 and h 2 , 2 {t) = lt<7r0.25sin(t) as well as the two 
crossexcitement functions hi^ 2 (^) = li<t<30.25 and h 2 ,i{t) = 0.5(1 + The dashed lines in the diagonal 

panels refer to the two components of the baseline intensity rj = (0.5,0.25). The functions are chosen quite 
extreme for the sake of demonstration of the estimation method; see Figure Bl. 
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(b) A realization of the two components of the process starting at time 0. The vertical lines refer to the events, 
the greyish solid lines refer to the realized conditional-intensity components and the dashed lines refer to the 
baseline-intensity components. The crossexcitement from component 1 on component 2 and also the delayed 
rectangle impuls impact from component 2 on component 1 are particularly visible. 

Figure B2. Illustration of a bivariate Hawkes process as described in Section 2.1. The upper panel shows the 
model parameters. The lower panel shows a realization. 
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QQ-plot of h2,i(l) 
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Figure B3. Illustration of the simulation study described in Section 3.3. The study confirms the distributional 
properties of our estimation procedure collected in Remark 1: We simulate 2 000 times from the bivariate Hawkes 
process introduced in Figure B2. In each simulation, we realize about 5 000 events in each component. For all 
of these samples, we calculate our estimator from Definition 3.6 as well as the covariance estimator from (16). 
These calculations depend on two parameters, the support s and the bin-size A. We apply s = 6 together with a 
relatively coarse bin-size A = 0.2. The upper-row panels illustrate the estimation of /i 2 ,i(l) = 0.5(1 + !)“^ = 0.125; 
the lower-row panels illustrate the estimation of the baseline-intensity component rji = 0.5. Left column panels: 
the asymptotic normal densities around the true values (grey vertical lines) are added to the histograms. The grey 
vertical lines refer to the true values. The means of the estimates (not illustrated) would cover the true values. 
Middle column panels: the QQ-plots support the asymptotic normality result. Right column panels: the boxplots 
collect the 2 000 estimated variances; see (16). The horizontal grey lines refer to the empirical variance of the 2 000 
estimates. 
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Figure B4. The Hawkes estimator from Definition 3.6 depends on the bin size A, on the size of the sample window 
T and on the support parameter s. We examine empirically how the variances of the estimates depend on these 
three parameters. We simulate a very large sample from a univariate Hawkes process with excitement function 
h : 1 lt< 3 (l + 1)~‘^ and baseline intensity rj = 1. With respect to this single sample, we calculate the estimated 
variance for the estimates of h{l) — 0.25 (crosses) and rj = 1 (triangles) using different A, T and s; see (16). The 
solid lines in the two left panels are A i—>■ ciA“^, respectively, T i—>■ C2T~^ , for some constants ci,C 2 > 0. The 
curves fit the variance estimates of the excitement-function estimate well. In contrast, the variance of the baseline 
estimate (triangles) is relatively constant with respect to A. In the right panel, we see that the larger the support 
parameter s, the larger the variances become—this seems natural, as we estimate more parameters with respect 
to the same sample size. 


Excitation estimate with A = 0.1 Excitation estimate with A = 0.5 




Excitation estimate with A = 1 



Figure B5. Illustration of the bias/variance trade-off in the choice of the bin size A; see Section 4.2. We simulate 
100 realizations of a Hawkes process. For each of these 100 samples, we calculate the estimator from Dehnition 3.6 
with respect to three different bin-sizes A £ {0.1,0.5,1}. The estimates are collected in boxplots. The grey lines 
denote the true excitement function h{t) = exp(—l.lf). A larger A leads to a larger bias. This is particularly 
obvious in the first boxplot of the right panel. Note that the bin sizes had to be chosen quite coarse to make this 
bias visible. A smaller A leads to larger pointwise variance of the excitement function value estimates. 
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(a) We consider a univariate Hawkes process with excitement function h{t) := lt< 3 exp(—t). The true support 
of the excitement is 3 (grey vertical lines). About 40 000 events are simulated from this model. Following the 
ideas brought forward in Section 4.1, we apply automatic support-selection on this single large sample using the 
AlC-criterion with three different values for the preliminary bin-size Aq > 0. The value where the minimum 
AlC-value is attained (black vertical lines) hardly depends on Aq and though all three bin-sizes are rather coarse, 
the true support is estimated correctly up to few “A-ticks”. 
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(b) We examine the infinite support case. To that aim, we consider simulated data from three univariate Hawkes process 
with excitement functions ha{t) := exp(—at), where a = 1.1 (circles for AlC-values and solid vertical line for AlC-minimizing 
support), a = 1.5 (triangles and dashed line) and a = 2 (crosses and dotted line). The larger a, the lighter the tail of the 
function and, as desired, the smaller our estimated support; see Section 4.1. Note that the cut-off error (8) is in all three cases 
so small (< 10~3 and much less) that it will typically be negligible in comparison to the estimation standard errors. 

Figure B6. Simulation study on the choice of the support parameter of our estimator from Definition 3.6; see 
Section 4.1. Figure 6(a) illustrates the case where the true underlying support is finite and Figure 6(b) illustrates 
the case where it is infinite. 
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(a) Support analysis with respect to a 
very coarse preliminary bin-size Aq = 
Isec; see Section 4.1. The estimator 
from Definition 3.6 is calculated for 
different support candidates (in sec¬ 
onds). The corresponding AIC-values 
are calculated as in (18). We establish 
a quite short AlC-optimal support of 
the excitement function. 


(b) After the rough analysis illus¬ 
trated in Figure 7(a), we repeat the 
procedure for smaller support can¬ 
didates and with respect to a much 
finer bin-size A = 0.01 sec. We 

find an AlC-optimal support value 
of about 2.7 seconds. This value is 
stable over other choices of the bin- 
size. The attained minumum is re¬ 
markably clear-cut compared to the 
attained minimum in the simuation 
study illustrated in Figure B6. 


(c) Bin-size analysis following the 
method from Section 4.2. The base¬ 
line estimates decrease in both com¬ 
ponents as the applied bin-sizes de¬ 
crease. For A smaller than 0.01 sec, 
the decrease is of a lower magnitude 
than the 95%-confidence intervals. 
We conclude that, for A < 0.01 sec, 
the bias of our estimation method be¬ 
comes negligible. 


Figure B7. Preliminary analysis for the bivariate data example {T,C) (trades/limit orders); see Section 5.2. Our 
nonparametric Hawkes estimator from Definition 3.6 depends on a support parameter s and a bin-size parameter 
A. Applying the selection methods from Section 4.1 and Section 4.2, we find that s = 3 sec and A = 0.01 sec are 
reasonable choices. 
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(a) Bivariate fit with respect to bin size A = 0.01 sec and support s = 3 sec. For the derivation of these estimation 
parameters; see Figure B7. Eyeball examination reveals local maxima in the lower panels at half seconds. 
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(b) We fit the Hawkes model to the same sample as in (a). This time however, we ignore the best support choice and set 
it naively to s = 0.1 sec only. In addition, we apply an extremely small bin size of A = 0.002 sec. In the first milliseconds 
after each event, the results indicate an inhibitory effect; the Hawkes model does not allow for negative excitement. In 
the smoothed function-estimate of the selfexcitement of the first component (the trades process), we detect local maxima 
at 0.1 sec multiples. 

Figure B8. Exemplary Hawkes fits of the bivariate data example (T, £) described in Section 5.2 with respect to 
two sets of estimation parameters (s, A). In the fitted bivariate process, the first component refers to the trade 
times T and the second component to the limit order arrivals, respectively, cancelations C. The black solid lines 
are kernel-smoothed versions of the estimates; see the end of Section 3.2. The dotted lines in the diagonal plots 
refer to the fitted baseline-intensity components. 
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